Clustering with t-SNE, provably

نویسندگان

  • George C. Linderman
  • Stefan Steinerberger
چکیده

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the ‘early exaggeration’ phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter α and step size h. Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g. the swiss roll) improves. We also discuss a connection to spectral clustering methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Analysis of the t-SNE Algorithm for Data Visualization

A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard JohnsonLindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de f...

متن کامل

Multiple‐cumulative probabilities used to cluster and visualize transcriptomes

Analysis of gene expression data by clustering and visualizing played a central role in obtaining biological knowledge. Here, we used Pearson's correlation coefficient of multiple-cumulative probabilities (PCC-MCP) of genes to define the similarity of gene expression behaviors. To answer the challenge of the high-dimensional MCPs, we used icc-cluster, a clustering algorithm that obtained soluti...

متن کامل

Visualization and Automatic Typology Construction of Pottery Profiles

Since the second half of the last century, computers became available to archaeologists. Since then, many have used various multivariate analysis techniques in the classification of archaeological objects with more or less success. Over the last decades, the popularity of multivariate analysis seems to have seized, partly because of the frequently disappointing performance of techniques such as...

متن کامل

Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding

Abstract. t-distributed Stochastic Neighborhood Embedding (t-SNE) is a method for dimensionality reduction and visualization that has become widely popular in recent years. Efficient implementations of t-SNE are available, but they scale poorly to datasets with hundreds of thousands to millions of high dimensional data-points. We present Fast Fourier Transformaccelerated Interpolation-based t-S...

متن کامل

Problems with small area surveys: lensing covariance of supernova distance measurements.

While luminosity distances from type Ia supernovae (SNe) are a powerful probe of cosmology, the accuracy with which these distances can be measured is limited by cosmic magnification due to gravitational lensing by the intervening large-scale structure. Spatial clustering of foreground mass leads to correlated errors in SNe distances. By including the full covariance matrix of SNe, we show that...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1706.02582  شماره 

صفحات  -

تاریخ انتشار 2017